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We say that two probabilities are similar at level a if they are contaminated versions (up to 
an a fraction) of the same common probability. We show how this model is related to minimal 
distances between sets of trimmed probabilities. Empirical versions turn out to present an over- 
fitting effect in the sense that trimming beyond the similarity level results in trimmed samples 
that are closer than expected to each other. We show how this can be combined with a bootstrap 
approach to assess similarity from two data samples. 

Keywords: asymptotics; bootstrap; consistency; mass transportation problem; over-fitting; 
robustness; similarity of distributions; trimmed probability; Wasserstein distance 

1. Similarity vs. homogeneity 

Classical goodness of fit deals with the problem of assessing whether the unknown random 
generator, P, of a data object, X, belongs to a given class, F. This includes two-sample 
problems in which two different random objects are observed. We focus on checking 
whether a certain feature of the corresponding random generators coincides. The case 
in which X\ is a collection of i.i.d. random variables A 1 ,..., A 1 with common distri- 
bution Pi, X2 is another sequence of i.i.d. random variables X 2 ,...,A^ with law P2 
and the goal is to assess whether 6 (Pi) = 9(P 2 ) for some function 9(-) (including, for 
instance, 0(P) = P) is a homogeneity problem, to which a large amount of literature has 
been devoted. Our starting point is that it is often the case that the researcher is not 
really interested in checking whether P € T or whether Pi = P-2- Imagine the case of 
a pharmaceutical company trying to introduce a new (and cheaper) alternative to some 
reference drug. The regulatory authorities will approve the new drug if its performance 
with respect to a certain biological magnitude does not differ from that of the standard 
drug. Both drugs could produce a similar outcome on most patients. However, if there 
is a fraction of them for whom the results are clearly different, then the new drug is 
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very likely to be rejected by a homogeneity test, while, in fact, it has a similar perfor- 
mance for most individuals. As another example, consider the comparison of two human 
populations that were initially equal but have received immigration with different pat- 
terns. In these situations the relevant assumption to check is not homogeneity, but rather 
similarity in the following sense. 

Definition 1. Two probability measures P\ and P2 on the same sample space are a-si- 
milar if there exist probability measures Pq, P[, P'^ such that 

(P 1 = {l-e 1 )P + e 1 Pl, ( 
\P2 = {l-e2)Po + e 2 Pi [ ' 

with < £i < a, i = 1,2. 

Definition 1 measures the overlap between Pi and P 2 , hi agreement with other possible 
measures of similarity (see the section "Similarity between Populations" in [14]). Beware 
that smaller values of a in Definition 1 correspond to more similar distributions (the case 
a = being equivalent to Pi = P2). 

A related situation, for one-sample problems, would be the case when we observe some 
random object X with law Pi. Ideally, Pi should equal Pq (some gold standard), but the 
presence of noise means that, in fact, 

P 1 = (l-e)P + sN, s<a (2) 

for some unspecified N if we assume that the noise level does not exceed a. We would 
say that Pi is similar to Pq at level a if (2) holds (observe that Pi and Pq do not 
play a symmetric role in this definition). In two-sample problems, we want to assess 
whether the two samples can be assumed to be noisy realizations of some unkown gold 
standard, as in Definition 1. Model (2) corresponds to the 'contamination neighborhoods' 
introduced in Huber [15, 16] in a robust testing setup. We discuss further connections 
to these and other related references in Section 2.2 below. Our goal in this work is to 
present a method for assessing similarity of the unknown random generators Pi , P2 of 
two independent i.i.d. samples. Our procedure also yields an estimate of the common 
core of the two distributions. 

Our approach is based on trimming. Trimming procedures are of frequent use in robust 
statistics as a way of downplaying the influence of contaminating data in our inferences. 
The introduction of data-dependent versions of trimming, often called impartial trim- 
ming, allows us to overcome some limitations of earlier versions of trimming that simply 
removed extreme observations at tails. Generally, impartial trimming is based on some 
optimization criterion, keeping the fraction of the sample (of a prescribed size) that yields 
the least possible deviation with respect to a theoretical model. Today, impartial trim- 
ming constitutes one of the main tools in the robust approach to a variety of statistical 
settings (see [9, 12, 18, 23]). The first approach to model validation based on impar- 
tial trimming is (to the best of our knowledge) the one in Alvarez-Esteban et al. [1, 3]. 
The problem considered there can be rephrased as follows. Given two independent i.i.d. 
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samples of univariate data with unknown random generators Pi , P^ , we want to assess 
whether Pi = C(ifi(Z)), £ = 1,2, for some random variable Z defined on a probability 
space (fi,J",P) and non-decreasing functions, <pi, (f2, such that 

P(<pi(Z) ^ <p 2 (Z)) < a 

(see Section 2.2 for further discussion). Despite the interest of this approach, we be- 
lieve that the similarity model given by Definition 1 is often more natural and useful in 
applications. Some technically related results and the connection with the optimal trans- 
portation problem have been reported in Alvarez-Esteban et al. [2]. A related approach 
based on density estimation can be found in Martmez-Camblor et al. [19]. 

As we will show in Section 2, the similarity model of Definition 1 can be expressed 
in terms of a minimal distance between the sets of trimmings of the probabilities Pi, 
i = 1,2. These are the sets of probabilities that one obtains from a fixed one by removing 
or downplaying (to some degree) the weight assigned by the original probability. When 
we look for the minimal distance between trimmings of the empirical measures based on 
two samples, we are highlighting the part of the data that, hopefully, comes from the 
common core Pq . From a descriptive point of view, this gives an interesting tool for the 
comparison of data samples. 

A distinctive feature of our proposal concerns the rates of convergence. If P n , Q n 
are the empirical distributions based on two samples of univariate data (of equal size 
for simplicity), we will trim up to an a-fraction of data from both samples in order to 
minimize some distance, d(-,-); and if we write P n , a , Qn,a for the optimally trimmed 
empirical distributions, we will have d(P nta ,Q n ^ a ) < d(P n ,Q n ). Trimming procedures 
generally give a balanced compromise between efficiency and robustness, and increasing 
the level of trimming has a moderate effect on the efficiency. Thus, for univariate i.i.d. 
data coming from equal random generators, we typically have d(P n , Q n ) = Op(n _1//2 ) and 
d{P n ,aiQn,a) = Op(n -1 / 2 ), but it is not true that d(P n>a ,Q n , a ) =op(?r 1 / 2 ) (see, e.g., 
Theorem A.l in [1]). However, for our procedure, over-trimming (i.e., trimming beyond 
the similarity level) will produce an over-fitting effect, namely, d(P n a , Q n ,a) = op(n -1 / 2 ). 
That will be the key for the statistical application of the procedure. Roughly speaking, if 
two random samples are trimmed more than required to delete contamination, then two 
samples far more similar than expected are obtained and it is feasible to distinguish this 
pair of trimmed samples from any other pair of non-trimmed, non-contaminated samples. 
We formalize this idea in Section 2. As in Alvarez-Esteban et al. [1], our choice for the 
metric d is the L2 Wasserstein distance. 

This over-fitting effect can be combined with a bootstrap procedure to consistently 
decide if the underlying distributions of two i.i.d. samples are similar in the sense of 
Definition 1 as we show in Section 3. This statistical procedure should also be useful in 
other frameworks of model validation. The consistency of our procedure is independent 
of the kind of contaminations. However, as expected, inkers are harder to detect than 
outliers. In this proposal, we have to consider small resampling sizes in the presence of 
inliers. This is discussed in Section 4, where we present some simulations showing the 
performance of our bootstrap procedure over finite samples. We also include the analysis 
of a real data set. 
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For the sake of readability we have moved most of the proofs to an Appendix, together 
with some additional results on rates of convergence. 

Throughout the paper V will be the set of Borel probability measures on the real 
line, R, while T v will denote the set of distributions in V with finite pth absolute moment. 
If F is a distribution function, F~ x will denote its generalized inverse or quantilc function. 
Given P,Q eP, by P -^.Q we will denote absolute continuity of P with respect to Q, 
and by ^ the corresponding Radon-Nikodym derivative. Unless otherwise stated, the 
random variables will be assumed to be defined on the same probability space (f2, <t, v). 
Weak convergence of probabilities will be denoted by — > w and C{X) (resp., EX) will 
denote the law (resp., the mean) of the variable X . The indicator function of a set A will 
be I a and I will denote the Lebesgue measure. 



2. Trimming and over-fitting 
2.1. Trimmings of a distribution 

Trimming an a-fraction of data in a sample of size n can be understood as replacing the 
empirical measure by a new one in which the data are reweighted so that the trimmed 
points now have zero probability while the remaining points will have weight l/?i(l — a). 
By analogy we can define the trimming of a distribution as follows. 

Definition 2. Given a € (0, 1), we define the set of a-trimmed versions of P by 

n a (P):=I.Qer: Q^P,^-<-^—,P-a.s.\. (3) 
^ aP 1 — a J 

This definition has been considered by several authors (see [1, 7, 13]). It allows the 
consideration of partial removal of the points in the support of the probability. This 
flexibility results in nice properties of the sets of trimmings, making lZ a (P) a convex set, 
compact for the topology of weak convergence (see Proposition 2.1 in [2]). 

In this paper we use the quadratic Wasscrstein distance, W2, namely, the minimal 
quadratic transportation cost between probabilities with finite second moment. W2 met- 
rizes weak convergence plus convergence of second moments. We refer the reader to 
Section 8 of Bickel and Freedman [4] for further details on W^- On the real line W2 is 
simply the L2 distance between quantile functions, that is, Wf (Pi,/^) = J^iFi 1 ^) — 
F2 (i)) 2 di if is the quantile function of Pi. Trimmings are also well behaved with 
respect to W2, as shown in Alvarez-Esteban et al. [2]. For instance, for P E F2, lZ a (P) 
is a compact subset of T2 for W2 (see Proposition 2.8 in [2]). A simple consequence is 
that in 

W2{n a {P 1 ),n a {P 2 )):^ min W 2 (Rx,R 2 ) (4) 

the minimum is indeed attained. A remarkable result is that the minimizer is unique under 
mild assumptions. This is Theorem 2.16 in Alvarez-Esteban et al. [2], which generalizes 
related results in Caffarelli and McCann [6] and Figalli [II]. 
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Proposition 1. If P\,P% G T 2 , < a < 1 and Pi or P2 has a density, then there exists 
a unique pair (Pi, a ,P 2 , a ) £ lZ a (P\) x lZ a (P 2 ) such that 

m{Pi, a ,P2, a )=m{n a {p 1 ),n a {p 2 )), 

provided W 2 (TZ a (Pi),TZ a (P 2 ))>0. 

The connection between trimmings and the similarity model of Definition 1 is given by 
the next result. Here dxv denotes the distance in total variation, namely, dTy(Pi, P 2 ) = 
sup s \ Pi{B) — P 2 (B)\, where B ranges among all Borel sets. 

Proposition 2. For a S [0, 1) the following are equivalent: 

(a) Pi and P 2 are a- similar. 

(b) n a (Pi)nn a (P 2 )^0. 

(c) d TV (P u P 2 )<a. 

If P\,P 2 GT 2 , then (a), (b) or (c) is equivalent to 

(d) W 2 (K a (Pi),K a (P 2 ))=0. 

Finally, the common core distribution, Pq, in Definition 1 is unique if and only if 
dry (Pi , P 2 ) = ct. In this case, Pq is given by the density /o = (fi A f 2 ) / (1 — a) with respect 
to fi if [i is a common a -finite dominating measure for Pi and P 2 and fi and f 2 are the 
corresponding densities and we have the canonical decomposition Pj = (1 — a)Po + ctP[, 
i = 1,2, P[ having density —(fi — fi A f 2 ) with respect to fi. 

Proof. If (a) holds, then P (A) < j^P^A) for all Borel A. In particular, P < P and, 
if A, = > (1 - a)- 1 }, obviously P (AA = and P e 1l a (Pi) Dll a (P 2 ), showing (b). 
Assume now (b) and take P € Tl a (Pi) n TZ a (P 2 ). Then (1 - a)P (A) < Pi(A) for all A. 
If a = 0, then (c) holds trivially. Otherwise define P[(A) = (P l (A) - (1 - a)P (A))/a. 
Then P[ is a probability and dTv(Pi,P 2 ) = adTy(P{, Pj) < a, that is, (c) holds. Finally, 
we assume that (c) holds and take fi to be a common cr-finitc dominating measure for Pi 
and P 2 and write fi and f 2 for the corresponding densities. Then (see Lemma 2.20 in [20]) 
d^v(Pi,P 2 ) = 1 — f(fi A f 2 )dfi (where a A & means min(a,6)). Write e = dTy(Pi,P 2 ) 
and assume e > (the case e = is trivial). We set f[ = (fi — fi A f 2 )/e, i = 1,2, and 
/o = (/1 A f 2 )/ (1 — e). /o, fi, f 2 arc densities with respect to fi. We write Pq, P[,P 2 for the 
associated probabilities. Then (1) holds with ei — e 2 = e < a. Equivalence of (b) and (d) 
follows from compactness of the sets of trimmings. The last claim follows easily from the 
arguments above. □ 

Remark 1. It follows from Proposition 2 that W 2 (TZ a (Pi),TZ a (P 2 )) > if and only 
if dTv(Pi,P 2 ) > ct, that is, dTv(Pi, P 2 ) is the minimal level of trimming required to 
make Pi and P 2 equal. Also, if d^y(Pi,P 2 ) = a, then the probability Pq with density 
fo = (fi A f 2 )/(l — ct) with respect to \i (as in the proof above) is the unique clement 
in lZ a (Pi) r)lZ a (P 2 ). This means that, as in Proposition 1, there is also a unique pair, 
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namely, (P ,Po) € TZ a (Pi) X TZ a (P 2 ) such that 

W 2 (P ,Po) - w 2 {n a {p 1 ),n a {p 2 )) = 0. 

This extends the result in Proposition 1 to the case dxv (-Pi , P2 ) > a - 

Proposition 2 shows that the similarity model (1) can be expressed in terms of different 
metrics. In fact, (d) would remain true if W 2 were replaced by any other metric for 
which the sets of trimmings are compact. With applications in mind, W2 turns out to 
be a more convenient choice. In order to assess (1) from two samples of i.i.d. data with 
empirical distributions P\,„ and P 2 ,mi say, we will have <iTv(Pi,n, P2,m) = 1 almost surely 
(provided P\ and P 2 have densities) and we cannot use (at least in a naive fashion) 
formulation (c). On the other hand, W2 is well behaved in this respect and empirical 
versions of both the minimal distances and the minimizers are consistent estimators of 
their theoretical counterparts. This is the content of the following result (Theorem 2.17 
in [2]). We quote it here for completeness. 

Theorem 1 (Consistency). Let {X n } n , {Y n } n be two sequences of i.i.d. random 
variables with C(X n ) = P, C(Y n ) = Q, P,Q G T 2 , and write P n , Q m for the empiri- 
cal distributions based on the samples X±, . . . ,X n and Y\, . . . ,Y m , respectively. Then, if 
min(m, n) — > 00, 

m(n a {P n ),K a (Q m ))^W 2 (K a (P),K a (Q)) a.s. 

Further, if P or Q <C t and dxv(-Pj Q) > ct, then 

W 2 (P„,a,Pa) ^0 and W 2 (Q m , a ,Q a )^0 a.s., 

where (P a , Q a ) = argmin HieKa ( P )^ 2eKa( Q) W 2 {Ri, R 2 ) and {P„, a ,Q m<a ) are defined 
similarly from P n , Q m . 

2.2. Related concepts and works 

The similarity model (1) is obviously related to the so-called 'contamination neighbor- 
hoods' of a probability Pq, defined as 

V £ (P Q ) :={(l-e)P +eP': P' G V} 

(5) 

= {Q G V: Q(A) < (1 - e)Po(A) + e for every Borel set A}, 

which have been widely used in the theory of robust statistics after the pioneering works 
by Hubcr [15, 16]. In particular, Hubcr [16] introduced these neighborhoods in robust 
testing, providing a robust version of the Neyman-Pearson lemma for simple hypothesis 
versus simple alternative. This theory was completed for more general sets of hypothe- 
ses and alternatives, additionally considering more flexible neighborhoods in Huber and 
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Strassen [17], Riedcr [22] and Buja [5]. In fact, Rieder's neighborhoods of a probability Pq, 
defined as 

V^ s (Po) := {Q E V: Q(A) < (1 - e)P (A) + e + 6 for every Borcl set A}, (6) 

comprise contamination as well as total variation norm neighborhoods (taking 6 = or 
e = 0, rcsp.). 

It can be easily shown (see also Proposition 2.1 in [2]) that P £ V e (Po) is equivalent 
to Pq € lZ e (P). Thus, our statement Pi and P2 are a-similar can also be expressed, in 
terms of contamination neighborhoods, as there exists a probability Pq such that P\,P 2 € 
V a (Po). However, there are different possibilities for such Pq, and the model considered in 
this paper, given through any one of the equivalent statements in Proposition 2, cannot 
be expressed in terms of a neighborhood, like (5) or (6) of a fixed probability. 

Further related work includes Alvarcz-Estcban et al. [1] , where it is shown, for a prob- 
ability, P, on the real line, that TZ a (P) can be expressed in terms of the trimmings of 
the uniform law on (0, 1), U(0, 1). This set can be identified with the set C a of absolutely 
continuous functions h : [0, 1] — > [0, 1] such that h(0) = 0, h(l) = 1, with derivative hf such 
that < h' < y^— . For function h, it is useful to write P% for the probability measure 
with distribution function h(P{— 00, t\). Then 

K a (P) = {P h : h€C a }. (7) 

Hence, we can measure the deviation between the sets of trimmings of P and Q through 

T a (P,Q):=minW2(Ph,Qh)- 

We call T a {P,Q) the common trimming distance between P and Q. If P and Q have 
quantile functions F^ 1 and then a simple change of variable shows 

W 2 (P h ,Qh) = f {F-\h-\x)) - G- l {hr\x))fdx 
Jo 

= l\F- 1 {y)-G- 1 {y)) 2 h'( V )A V . 
Jo 

Thus, T a (P,Q) = if and only if £({y £ (0, 1): F~ 1 (y) ^ G~ 1 (y)}) < a. It follows eas- 
ily from this that T a {P,Q) = if and only if there is a random variable Z defined on 
a probability space (fl, J-, P) and non-decreasing, left-continuous functions, ip\, if2, with 
£(<p 1 (Z))=P, £(<p 2 (Z))=Q such that 

W(<Pi(Z)£ <*■ (8) 

In contrast, since d T v(P,Q) = min{P(X ^ Y): £(X) = P,C(Y) = Q} (see Lemma 2.20 
in [20]), we see that W 2 {1l a (P),1l ol (Q)) = if and only if C(ipi(Z)) = P, C(ip 2 [Z)) = Q 
for some random variable Z and measurable (not necessarily monotonic) such that (8) 
holds. In summary, two random objects are a-similar if and only if they are different 
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Figure 1. Densities of optimally trimmed P and Q with independent trimming (first row) and 
common trimming (second row). 



transforms of a common random signal and the transforms differ from each other with 
probability at most a; they are equivalent in terms of common trimming if and only if 
they are different monotonic transforms of a common random signal and the transforms 
differ from each other with probability at most a. In the somewhat artificial event that we 
believe that our two samples come from a monotonic, possibly different, transform of some 
original signal, then the common trimming similarity model is reasonable. Otherwise, 
the similarity model (1) is the natural choice. For a less technical illustration of this 
idea we show in Figure 1 the different effect of independent and common trimming. We 
have taken P = N(0, 1), Q = 0.8N(0, 1) + 0.2N(4, 1) and three values of the trimming 
level, a. In the first row we show the densities of P a (blue line) and Q a (red line), with 
{PaiQa) = argmin flie7? . Q (p)^ 2e 7 ? . a (Q) W 2 {Ri,R2)- In this case, trimming a = 0.2 results 
in P a = Q a , that is, trimming removes contamination. The second row shows the densities 
of Ph a (blue line) and Qh a (red line), with h a = argmin/ ie e„ W^-Pft,, Qh)- Clearly, Ph a 
and Qh a are different and this remains true no matter how close to 1 we choose a. If 
trimming is used with the goal of removing contamination and assessing that the core 
of the two distributions are equal, then it is clear that the common trimming approach 
fails to do so. 
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In Alvarez-Esteban et al. [3] we have considered, under this common trimming setup, 
the problem of testing whether a random sample can be considered 'mostly normal', 
that is, if the generator of the sample is similar to a normal distribution with unknown 
parameters. 

Finally let us mention the application in Alvarez-Esteban et al. [2] of some asymptotic 
results for a related two-sample problem: Given X\,..., X n i.i.d. P and Yi, . . . , Y m i.i.d. Q, 
we consider testing the related null hypotheses 

Hi. w 2 (n a (p),n a (Q))<A Q vs. w 2 {n a {p),n a {Q))> a , 
h 2 -. w 2 (n a (P),n a (Q))>A vs. w 2 (n a (p),n a (Q)) < a 

for a given threshold Ao > to be chosen by the practitioner. Observe that rejecting 
the null hypothesis H 2 allows us to conclude that, with high confidence, the unknown 
random generators P and Q are not far from similarity. 

2.3. The over- fitting effect of trimming 

In this subsection we keep the notation of Theorem 1 and assume that we deal with 
two independent samples, X\, . . . ,X n i.i.d. P and Yi, . . . ,Y m i.i.d. Q. We write P n , Q m 
for the empirical measures and P n , a , Qm,a are minimizers of the W 2 distance between 
trimmings of the empirical distributions P n ,Q m - 

It follows from Theorem 1 that W 2 (P n , a , Qm,a) — > a.s. when the similarity model (1) 
holds true and we may wonder about the rate of convergence in this limit. Note that 
under homogeneity, that is, if P = Q and taking n = m for simplicity, we have under 
integrability assumptions 

V^m(Pn,Qn)^ W U£ dt\ 1 , (9) 

where B is a Brownian bridge and / and F^ 1 are the density and quantile functions 
of P (this follows easily, for instance, from Theorem 4.6 in [10]). Thus, random sam- 
ples from homogeneous generators have empirical distributions at W 2 -distance of exact 
order n -1 / 2 , while, for non-homogeneous random generators W 2 {P n ,Q n ) — > W 2 (P,Q), 
a positive constant. Likewise, in the common trimming model of Section 2.2, if h n ^ a is 

SUCh that T a (P n ,Qn) = W2((Pn)h n , a ,(Qn)h n , a ) and WG write P nM = (Pn)h n . a , Qn.a = 

(Qn)h n a (the optimal trimmings of the empirical measures), then, under 7~ a (P,Q) = 0, 
we have that \/nyV 2 (P n , a ,Qn,a) converges in law to a non-null limit (Theorem A.l in [1]), 
whereas if 7~ a (P,Q) > 0, then W 2 {P n , a ,Qn,a) converges a.s. to a positive constant. 

In the similarity model (1) the gap between the null and the alternative is of higher 
order. If P and Q are not similar at level a, then W 2 (P n . a , Qm.a) — > W 2 (Pa,Qa) > 
(Theorem 1). On the other hand, if d,T\r(P,Q) < a, then our next result shows that 
\friW 2 {P n ,ai Qn,a) — > in probability. To avoid integrability issues, we assume P and Q to 
have bounded support; this is enough for applications, since a monotonic transformation 
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Figure 2. Trajectories of the uniform empirical process (solid line) and two variants based on 
trimming. The trimming levels are a — 0.1 and a — 0.3 (dashed and dotted lines). 

of the data could achieve boundedness while preserving the distance in total variation. 
Furthermore, it ensures that the conditions dTv(P,Q) < a and W2(7l a (P),lZ a (Q)) = 
are equivalent. 

Theorem 2. Assume P,Q are supported in a common bounded interval and have 

densities bounded away from zero and with bounded derivatives. Assume further that 
n/ (n + m) — > A € (0, 1). If a n € (0, 1) satisfies a n > drv(P, Q) + ^= for some r n — > oo, 
then 

VnW2(P n .a n ,Qm,a n ) —> in probability. (10) 

We give a proof of Theorem 2 in the Appendix. A similar over-fitting effect is observed 
if a sample is over-trimmed to optimally fit a given model: If X\, . . . ,X n are i.i.d. P, 
P, ha = argmin J j e7ict (p n ) W%{R, Q) and W2(7£q (-P),<2) =0 for some a < a, then (see 
Theorem 5 in the Appendix) 

y/nW2(P n ,a, Q) —> in probability. 

Empirical evidence of this over- fitting effect is shown in Figure 2. A random sample of 
size n = 1000 from a t/(0, 1) distribution was taken. This sample was trimmed using the 
proportions a = 0,0.1,0.3 in order to obtain a sample as close to the U(0, 1) as possible. 
We denote by F" the distribution function of P n ,Q and in Figure 2, we represent the 
empirical processes D%(t)=n 1 / 2 (F£(t) -t), te [0,1] for a = 0,0.1, 0.3. 

Since the true random generator and the target are the same, no trimming is required 
in this case to remove contamination and, for a > 0, we are over-trimming. Observe 
that -D^ 1 and 3 do not differ too much from each other, while they are quite far from 
the untrimmed version. 
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3. A bootstrap assessment of similarity 

We show in this section how we can use the over-fitting effect of trimming for the assess- 
ment of the similarity model (1). Again, we will assume that we observe two independent 
random samples X\, . . . ,X n i.i.d. P, Yi,...,Y m i.i.d. Q. We would like to test the null 
hypothesis Hq: d,T\{P,Q) < a. Theorem 2 says that trimming beyond the similarity 
level kills randomness and results in (trimmed) samples that are more similar to each 
other than random samples coming from the same generator. We will use a bootstrap 
approach to generate suitable random samples from a common generator and compare 
the optimally trimmed distance to the distance computed on the bootstrap replicates. 
We write P n , Q m for the empirical distributions and, given a„£ (0, 1), 

{Pn,a n ,Qm,a n )= argmin yV 2 (-Ri,-R2), 

fli6R„„ (P„),-R2eK Q „ (Q m ) 
SO that W2(Pn,a n ,Qm,aJ=m(Tl an (P n ),n an (Q m )). 

We consider now the pooled probability 

n m 

Rn.m : Pn,ot n T" ' Qm z a n ' 

n + in n + m 

R n ^ m is a random probability measure concentrated on {Z\, . . . , Z n+m }, where Zj = Xj 
for j = 1, . . . , n, and Zj = Y^_„ for j = n + 1, . . . , n + m. 

Conditionally, given the data, we draw new random variables, XI , . . . , X*, , Y*, . . . , Y m , 
i.i.d. R n .m, with m' = [n'm/n] and n' to be chosen later. We will use the notation P* 
for the bootstrap probability, that is, the conditional probability given the original data 
{X n } n , {Y m } m . Finally, by P*, and Q* m , we will denote the empirical measures based on 
X* , . . . , X*, and Y* , . . . , Y£, , respectively. Now, we define 

Pn, m :=r\\P^m(K>,Q* m >) > J-^m(p n , an ,Q m , an ))- (11) 

V n + m V n . + m 

p* n m is the bootstrap p-value for the similarity model (1), with rejection for small values 
of it. In practice p* n m can be approximated by Monte Carlo simulation. We note that 
if na n and ma n are integer, typically the trimming process will not produce partially 
trimmed points and P n ,a n and Q m ,a„ will be the empirical measures on the sets of non- 
trimmed data. If we take a n — > a, then if the similarity model fails, Y\?2(Pn,a„ , Qm,a n ) 
will be large while W^-Pjt' , Qm') wm vams h. On the other hand, for similar distributions 
YV2(Pn,a n ,Qm,a n ) will vanish at a faster rate than W2 (-P*/ , <2J„/ ) and rejection for small 
bootstrap p- values will result in a consistent rule. We make this precise in our next result. 

Theorem 3. With the above notation, assume that P,Q have densities satisfying the 
assumptions of Theorem 2. Assume further that n/ {n + m) — > A £ (0, 1) and take a n = 
a + K/ \fn/\m with K > 0. Then, if n' — > 00 and n' = O(n), 

(i) if dTv(P, Q) < a, then p* n m —tl in probability, 

(ii) if dTy(P, Q) > a, then p* n m — > in probability. 
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A proof of Theorem 3 is given in the Appendix. It roughly says that a test of the 
similarity model (1) that rejects a-similarity for values of p* n m above a fixed threshold 
L £ (0, 1) is a consistent rule. In order to make a sensible choice of the threshold, L, 
as well as of the constant, K, in Theorem 3, we still need to control the probability of 
rejection at the boundary of the null hypothesis; that is, in the case g?tv(-Pj Q) = &. In this 
case we write again Pq for the common part of P and Q in the canonical decomposition 
in Remark 1. If P n £ lZ an (P) and Q n £ TZ- an (Q), with a n as in Theorem 3, are such 
that W2{Pn,Qn) 0, then, by uniqueness, we have W2(P n , Po) 0. We introduce the 
following assumption about rates in this convergence: If P n £ TZ aji (P), Q n £ TZ an (Q) 
(and a n = cItv(P, Q) + ^j=), then, for some p £ (0, 1], 

W 2 (P„,Q„) = 0(n- 1 / 2 ) W 2 (P n ,P ) = O(n-> } / 2 ). (12) 

Under this assumption we can control the type I error probability using our next result. 

Theorem 4. Under the assumptions and notation of Theorem 3, if P and Q are such 
that dTv(P, Q) = a and satisfy ( 12), taking n' — > oo, n' = o(n p ) and 



\ ail — a] , , , . 

Vn A m 

with 7 £ (0, 1), then limsup„ P(p* m </?)</? + 7. 

The main consequence is that we can test the similarity model (1) at a given level 
/3 + 7 £ (0, 1). To be precise, if we replace our ideal H : drviP, Q) — a by Ho consisting of 
pairs (P, Q) satisfying the assumptions in Theorem 3 and d^y(P, Q) < a or dTy(P, Q) = a 
plus Condition (12), then, if we reject for p* n m < (3, Theorems 3 and 4 ensure 

sup ttmsupP( P) Q)(p* jm + 7, 

(P,Q)£H n 

where P(p,q) denotes probability assuming the laws of the X's and the Vs are P and Q, 
respectively. It is in this sense that we can say that the procedure is conservative, having 
an asymptotic level of, at most, j3 + 7; nevertheless, the test will consistently reject the 
similarity model if it fails. In the next section we show the performance in practice of 
this procedure. Of course, one would like to control 

limsup sup P( PiQ )(p* m <P) 

n (P,Q)<£H„ 

instead of the bound given by our results. Some of the limitations of our procedure come 
from the smoothness requirements posed by our choice of metric, Wi- This could, per- 
haps, be overcome with the use of the L\ Wasserstein metric (but we would lose the 
uniqueness and consistency results given in Proposition 1 and Theorem 1) and consider- 
ation of a less restrictive null hypothesis, Hq. Uniformity in (P, Q) £ Ho is a more delicate 
issue, since one can take P and Q at an arbitrary (but positive) Wasserstein distance 
from each other, but such that they are at distance one in total variation. Perhaps a dif- 
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ferent choice of metric could lead to some type of uniform bound. We believe this issue 
is worth further research. 

Turning to the meaning of Condition (12), observe that the contaminations, P[, P 2 , in 
the canonical decomposition in Proposition 2 have disjoint support but can be arbitrarily 
close in Wasserstein distance. With Condition (12) we avoid pathological cases in which 
some inconvenient distribution of the contaminations allows that some trimmings of P 
and Q, with trimming size slightly above the similarity level, are close to each other 
without being too close to the common core. Rather than pursuing an involved technical 
analysis we include a couple of illustrative examples that show that the best possible 
rate p depends on the degree of separation between the contaminating distributions P[ , P 2 
in the canonical decomposition. In the well-separated case (when the distance between 
the supports of P[ and P 2 is positive), under additional technical conditions we can 
take p = 1 and we have that the optimal trimming, P n .a„ , approaches the common 
part, P , at the parametric rate: W2(P n ,a n , Po) = Op(?i -1 / 2 ). Without this separation 
we cannot take p greater than 4/5 and we have a nonparametric rate of convergence: 
y^2{Pn,a n , Po) = Op(n~ 2 / 5 ). Again, in our examples we assume P and Q to have bounded 
support since this is enough for applications. 

Example 1 (The well-separated case). Assume P and Q are probabilities on the real 
line with quantile functions, F^ 1 and G" 1 , such that G" 1 (t) = F (t + a), < t < 1 — a 
and -F -1 has a bounded derivative (as in Figure 3(a)). Then dT~v(P,Q) = ct and, taking 
a n = a + for some K > and writing Pq for the common part in the canonical 

decomposition for P and Q, we have that if P n € 7Z a „ (P)> Qn G T^-a n {Q), then 
m{P n ,Q n )=0{n- 1 ' 2 ) => W 2 (P n ,Po) = 0(n- 1 / 2 ). 

Example 2 (The non-separated case). We assume now that P and Q differ only 
in location and have a symmetric, unimodal density. Without loss of generality, we 
write F(- + /z/2) and F(- — p/2) for the distribution functions of P and Q, respectively, 
and f for the density associated to F. We suppose that F has bounded support and f 
is strictly positive on it. Further, we assume f to be continuously differ entiable with 
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f < in (0,sup(supp(F))). // ^ and a satisfy 1 - a = 2(1 - F(/i/2)) = 2F(-fj,/2), then 
d T v{P, Q) = a (see Figure 3(b) ). If P n e K an (P), Q„ € K an (Q), then 

W 2 (P„,Q„)=0(n- 1 / 2 ) W 2 (P„,P ) = O(n- 2 / 5 ). 

A proof of the claims in the last two examples is sketched in the Appendix. 

While this work is concerned mainly with testing a-similarity in two-sample problems, 
in many real problems the interest could be focused on the estimation on the common 
core Pq. The results in Section 2 ensure that the pooled probability, P n , m , in our bootstrap 
procedure is a consistent estimator of Pq if a equals the (unknown) distance in total 
variation between P and Q. Our simulations in Section 4 (sec Figure 5 and the related 
comments) suggest that the bootstrap p-value curves (the values of p* nrn as a function 
of a) change sharply from to 1 around the true similarity level. Maybe this rapid 
growth could be used to give some estimation of the similarity level and, as a result, of 
the common core. Further research is needed. 

We conclude this section by presenting a simple upper bound for the transportation 
cost between empirical measures. This result, together with Theorem 2, is the key in our 
proofs of Theorems 3 and 4 and has some independent interest. The proof is also included 
in the Appendix. Here X\ i, ■ ■ ■ , Xi >n ; X2,i, ■ ■ ■ ,X2, m are i.i.d. Revalued random vectors 
with common distribution P and Yx,i, - . . , Y\, n \ 1*2,1, • ■ • , Yi,m are i-i-d. Q. We write P n ,i 
and P m ,2 for the empirical measures based on X\ t \, . . . , X\ jn and X<x t \, . . . , X2, m , respec- 
tively, and, similarly, Q n ,i and Q m ,2 for the empirical measures based on the Y%,j- Let us 
define 

S n ,m : = Wp(P ra ,l, P m ,2) and T nim :=Wp(Qn,l,Qm,2J- 



Proposition 3. With the above notation, if p> 1, then 

W p (C(S n , m ),C(T n , m )) < 2W P (P,Q). 

4. Empirical analysis of the procedure 

In this section we explore the performance of the procedure for finite samples. The section 
is divided in two subsections that address the analysis of a planned simulation study and 
of a case study, respectively. To simplify our exposition we will assume equal sizes in the 
two samples through the first subsection. All the computations have been carried out 
with the programs available at http://www.eio.uva.es/~pedroc/R. 

4.1. A simulation study 

We consider first an example that illustrates the over-fitting effect on the bootstrap p- 
values. We generate 200 pairs of samples of size n = 1000 obtained from the ./V(0, 1) and 
the 0.9A(0, 1) + 0.17V(10, 3) distributions. Then, for each pair of samples, we carry out the 
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N(0,1) vs 0.9*N(0,1)+0.1*N(10,3) N(0,1) vs 0.9*N(0,1)+0.1*N(10,3) 




I I I I I I I I I I I I 

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

Trimming level: 9% Trimming level: 11% 



Figure 4. Histograms, for different sizes of trimming, of the bootstrap p- values obtained from 
200 pairs of samples from P = N(0, 1) and Q = 0.9JV(0, 1) + 0.1JV(10,3) distributions. 

bootstrap procedure (1000 bootstrap replicates in each run) for trimming levels a = 0.09 
and 0.11. At this point an important caution when dealing with mixtures should be 
made, namely the distinction between the level (0.1 in our case) of the "contaminating" 
distribution in the mixture and the similarity level between the non-contaminated and 
contaminated distributions. Of course, both distributions are similar at level 0.1, but they 
are also similar at a lower level (recall the canonical decomposition in Remark 1). For 
example, since the supports of the U(0, 1) and U(l, 2) distributions are disjoint, then the 
minimum level of similarity between the U (0, 1) and 0.9U (0, 1) + 0.117(1, 2) distributions 
is 0.1; but between the 2V(0, 1) and 0.9JV(0, 1) + 0.1N(fi, 3) distributions, it is strictly 
lower for every \x. For instance, this level is 0.0484 if fj, = 0, 0.0653 for fj, = 3; or 0.0989 
when fi = 10. 

Figure 4 shows the absolute frequencies of the bootstrap p- values, p* nn , obtained in 
this example. 

As stated above, the similarity level between the considered distributions is 0.0989. 
Thus, the probability of obtaining an observation from the non-common part in the mix- 
ture is 0.0989. Taking into account sample sizes and the number of samples considered, 
the expected number of times in which we obtain at most 110 'contaminating' observa- 
tions in both samples is 158.13. In these cases, after 0.11 trimming, we will be comparing 
similar samples and should have no evidence against similarity. We note that 158 is 
slightly below the observed frequency in the right bar of the right histogram in Figure 4. 
On the other hand, the expected number of times in which the amount of 'contaminat- 
ing' data exceeds 90 in both samples is 132.02. In this event, 0.09 trimming is unable 
to remove contamination and we should have strong evidence against similarity. We can 
check that 132 is close to the observed frequency in the left bar of the left histogram in 
Figure 4. 
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N(0,1) vs 0.9*N(0,1)+0.1*N(10,3) (n=100) N(0,1) vs 0.9*N(0,1)+0.1*N(10,3) (n=300) 




The comments above suggest that the p-values are very sensitive to the effective pro- 
portion of contamination in the data. This is further illustrated with the plots in Figure 5, 
which show the curves of bootstrap p-values conditioned to different ranges of contam- 
inating proportion in the second sample (the amount of data coming from the 7V(10,3) 
distribution). In this figure we observe that the transition from p- values close to to 
p- values close to 1 is very fast along the trimming level. In other words, the effect of 
under- /over-trimming becomes apparent very quickly. 

We show next a simulation study to illustrate the power performance for finite samples 
of the bootstrap procedure introduced in Section 3, when the trimming level, a„, is 
determined as in Theorem 4. We consider two different cases, comparing samples of the 
same size, n, of P = N(0, 1) versus Qi,i = 1,2. In the first case, Q\ = (1 — e)N(0, 1) + 
£iV(10, 1); the contamination is due to outliers. In the second case, the contamination is 
due to inliers and Q2 = (1 — e)7V(0, 1) + eN(0, 3). In both cases, the null hypothesis is Hq: 
dTv(P, Qi) < 0.1 and we use 1000 bootstrap pairs of samples to obtain p* n n , rejecting Hq 
if Pnn — 0-05 = (5. Then we compute the rejection frequencies in 1000 iterations of the 
procedure, obtaining the values shown in Tables 1 and 2. We do this for different values 
of £ (then different values of v = dTv(P,Qi)) and different resampling orders n' = n p . 
The simulation shows that the bound given in Theorem 4 is approached for moderate 
sizes in the first case (see Table 1, v = 0.10). However, in the second case, the procedure 
is conservative. The main conclusion is that in both cases the contamination is detected, 
but detection is more difficult in the case in which the contamination conies from inliers. 

We close this subsection with a comparison to classical testing procedures that could 
be adapted to the setup of similarity testing. We recall from Proposition 2 that test- 
ing a-similarity of P and Q is equivalent to testing whether sup^ \P{A) — Q(A)\ < a, 
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Table 1. Observed rejection frequencies for Hq: <1tv(P,Qi) < 0.1, P = N(0,1), Qi = (1 — 
e)JV(0,l)+eJV(10,l), where u = drv(P,Qi) and /J = 0.05 
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n 


p- 

7: 


1 




4/5 




2/3 




1/2 




0.05 


0.01 


0.05 


0.01 


0.05 


0.01 


0.05 


0.01 


0.10 


100 




0.008 


0.001 


0.016 


0.003 


0.043 


0.006 


0.047 


0.007 


e~0.10 


300 




0.030 


0.007 


0.040 


0.015 


0.059 


0.017 


0.065 


0.019 




1000 




0.052 


0.009 


0.092 


0.016 


0.098 


0.018 


0.114 


0.022 


0.15 


100 




0.130 


0.044 


0.207 


0.090 


0.246 


0.130 


0.252 


0.170 


e~0.15 


300 




0.587 


0.386 


0.648 


0.458 


0.687 


0.507 


0.703 


0.556 




1000 




0.996 


0.980 


0.998 


0.985 


0.998 


0.986 


0.999 


0.990 


0.20 


100 




0.576 


0.403 
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0.515 


0.732 


0.585 


0.738 


0.624 


e~0.20 


300 
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0.993 


0.985 


0.993 


0.986 
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1 


0.25 


100 




0.919 
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0.953 
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0.929 


e~0.25 
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1 
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1 


1 


1 
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1 


1 
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1 


1 


1 


1 


1 


1 


1 


1 



with A ranging among all (measurable) sets. If we focus on sets of type A — (— 00, x], then 
we could test the null hypothesis H : sup xgR \F(x) — G(x) \ < a using the Kolmogorov- 
Smirnov statistic: D n = sup x6R \F n (x) — G n (x)\, where F n and G n denote the empir- 
ical distribution functions (d.f.'s) based on the Xi and the Yj, respectively (and we 
have assumed for simplicity samples of equal size). It is known (see [21]) that, provided 
sup^gg \F(x) — G(x) \ = A > 0, >/ri(D n — A) converges weakly to Z\(F, G) = max(Zi, Z 2 ) 
with 

Zi= sup B 1 (G(x) + X)-B 2 (G(x)), 

{x:F(x)-G(x)=\} 

Z 2 = sup B a (G(x))-B 1 (G(x)-\), 

{x:G{x)-F(x)=\} 

where B\,B2 are independent Brownian bridges on (0,1). With standard arguments it 
can be shown that P(Z\(F, G) > t) < P(Z X > t) for t > 0, with Z x = sup 0<x<1 _ x Bi(x + 

A) — B2{x). Hence, if we choose such that P(Z a > z£^) = j3, then the test that rejects 
when 

D n >a + —^z^ 

is asymptotically of level (3 for testing H : sup^gg \F(x) — G(x)\ < a. The critical va- 
lue z£^ can be approximated by Monte Carlo simulation. We could try to use this 
procedure for testing the a-similarity model. Though, since we can find distributions 
that are arbitrarily close in Kolmogorov-Smirnov distance but far from each other in 
total variation distance, this alternative procedure can fail badly. We show this in our 
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Table 2. Observed rejection frequencies for H : dTv{P,Q 2 ) < 0.1, P = N(0,1), Qi = (1 - 
e)N(0, 1) + eJV(0, 3), where v = d T v{P, Q 2 ) and /3 = 0.05 
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0.146 


0.277 


0.163 


0.301 


0.189 


0.324 
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Table 3. Observed rejection frequencies for Ho 


: dTv(P, Q) < 0.1, P = 






N(0,1), Q = 


0.70iV(0, 1) + 0.15JV(2.35, 1) + 0.15iV(-2.35, 1) at level 0.05 
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100 




300 




500 




1000 






D„ 




0.007 




0.004 




0.003 




0.002 






m 




0.007 




0.091 




0.320 




0.875 





last simulation study (see Table 3). We have taken P = N(0, 1) and Q = 0.7CW(0, 1) + 
0.15iV(2.35, 1) + 0.15iV(— 2.35, 1), a mixture with three normal components. Here we have 
sup^gR |P(— 00, x] — Q{— 00, x] I =0.1 and drv(P,Q) =0.2 and we test H : dTv(P,Q) < 
0.1 at level 0.05. We show the observed frequencies of rejection for D n and our bootstrap 
procedure based on W2 as in Theorem 4 with p = 4/5, 7 = 0.01. In this case we reject for 
bootstrap p-values larger than 0.04 to make the asymptotic probability of type I error 
less than 0.05. We have considered sampling sizes n= 100,300,500 and 1000 and have 
produced 10,000 replicates of the tests in each case. We see that the Kolmogorov-Smirnov 
test fails to detect the dissimilarity, even for large sample sizes, while the bootstrap 
procedure suggested in this paper works reasonably for moderate sizes. 

4.2. A case study 

The data from this case study come from an admission exam to the Universidad de 
Valladolid. 308 exams on the same subject were randomly assigned to 2 markers. The 
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Figure 6. Best trimmings between markers 1 and 2, in the example of Section 4.2, a = 0.05 
(white), a — 0.10 (white + yellow) and a = 0.15 (white + yellow + orange). 



Table 4. Bootstrap p-values arising from the introduced bootstrap methodology, applied to the 
similarity analysis between markers (/3 = 0.05) 
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distribution of the exams was not exactly balanced and markers received 152 and 156 
exams, respectively. Each exam was given a grade between and 10 points. In the admis- 
sion exams some marking criteria are given to the markers with the goal of making the 
grading process "homogeneous" . The main goal of this study is to determine whether the 
markers are using the same common criteria. Some degree of deviation from this common 
pattern is allowed for each marker. Therefore, we would like to assess the similarity of 
the samples of marks for the different markers. 

The use of nonparametric methods strongly rejects, at level 0.05, homogeneity be- 
tween the considered marking distributions (Wilcoxon-Mann-Whitney, p- value = 0.000; 
and Kolmogorov-Smirnov, p- value = 0.003). In Figure 6 we show the histograms corre- 
sponding to the full data sets and the progressive effects of best trimming, minimizing the 
Wasserstcin distance between the remaining subsamplc distributions. The white portions 
of the bars represent the trimmed observations when the trimming size is a = 0.05, the 
union of the white and yellow portions are the trimmed observations when a — 0.1 and 
the orange portions complete the trimming corresponding to a = 0.15. Notice that the 
best trimming is far from being symmetric. 

In Table 4 we have included the p-values corresponding to the bootstrap procedure 
introduced in Section 3. In every case, for fixed /3 = 0.05 and taking a n as in Theorem 4, 
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we used 1000 bootstrap samples to compute the p- values for the null hypothesis H : 
d>Tv(P,Q) < ol. In general terms, these p-values show that both samples are not 0.05- 
similar, but they can be considered 0.10-similar. The considerations made in Section 3 
about Condition (12) show the convenience of using resampling orders less than or equal 
to n 4 / 5 , as we don't know if the supports of the contaminating distributions are well 
separated or not. 

Appendix 

A.l. Proof of Theorem 2 

Our proof is based on a parallel result for the one-sample case. Let P n be the empirical 
measure based on i.i.d. random variables X\, . . . , X n with common distribution P. In the 
particular case P = Q and a = we have nWf (P„, Q) = Op(l) under sufficient integra- 
bility assumptions (see [10]). From the obvious bound YV2(R- a (P n ),Q) < W2(Pn,Q) we 
see that nW$ (ft a (P n ), Q) = P (1). Our first result here shows that nyV%(K a (P n ),Q) = 
o P (l) even if Pj-Q. 

Theorem 5. Assume that Q S lZ ao (P) for some ao € [0,1), where Q is supported in 
a bounded interval, having a density function that is bounded away from zero on its 
support, and with a bounded derivative. If a n > ao +r n /s/ri for some sequence < r n — > 
oo, then 

\fnV\?2 (TZa n (Pn), Q) — > in probability as n — > oo. 

Proof. Arguing as in the proof of Proposition 2 we can check that Q £ lZ ao (P) is equiv- 
alent to P = (1 — ao)Q + aoP' for some distribution P' . Hence, we can assume X n = 
(1 — U n )Y n + U n Z n , where {1^}™, {Z n } n and {U n } n are independent i.i.d. sequences with 
laws Q, P' and Bernoulli with mean ao, respectively. Write N n = Y^i=i I(Ui — 1). Then 
N n follows a binomial distribution with parameters n and ao- Hence, y/n(N n /n — ao) — > 
y/ceo(l — ao)Z ', with Z standard normal. We assume w.l.o.g. that convergence holds, 
in fact, a.s. Write n' — n — N n , X\, . . . ,X n > for the Yj's in the sample with associated 
Ui = (the uncontaminated fraction of the sample: X\, . . . ,X n i are i.i.d. Q) and P n > 
for the empirical measure on the X^s. Observe that P n i € TZa n (P n ) with a n = N n /n. 
Now we note that given a, [3 € [0, 1), if Q G lZ a (P), then 1Zp(Q) C TZ a+ /3- a p(P). Hence, 
(Pn') C TZa n (Pn) for tt„ = (a n — d n )/(a n ) provided a n > a n , which eventually holds. 
Consequently, 

W 2 (K an (P n ),Q) < m (K &n (P n . ),Q). 

Thus, the result will follow if we prove it in the particular case P = Q and ao = 0. 

We proceed in this case writing F and / for the distribution and density functions 
of P. Recalling the parametrization in (7) we have 

Wl(K an (P n ),P)= min Wl((P n ) h ,P) = mm f (^(/r 1 ^)) - F" 1 ^)) 2 dt 
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and we see that nW%(lZ an (P rl ) 7 P) = min/ l gc Q „ M„(h), where 

M n (h) = J 1 ( f{ F-l\ t)) MF-\Kt)) F~\t))) "h'it) dt 

and p n (t) = y/nf(F^ 1 (t))(F 7 ^ 1 (t) — is the weighted quantile process. Without 

loss of generality, we can assume that {X n }„ arc defined in a sufficiently rich probability 
space in which there exist Brownian bridges, B n , satisfying 

U2-U „ lin \ Pn (t) - B n (t)\ _ fOp(logn), ifu = 0, 



l/«<t<I-l/„ (*(1-*))" 10P(1): if < !/ < 1/2 

(this is guaranteed by Theorem 6.2.1 in [8]). Now, defining 

N n (h) = f (jjj^ejj - V^(F-\h(t)) F- 1 ^ h'(t) dt, 
and assuming w.l.o.g. that a n < 1 — 8 for some 6 > we have that 

sup ,«.(*)* - *. W V>, < ( J £ ) 2 d.) " 2 - o P (D. 

The last equality follows from (13), taking v = 0, because, since / is bounded below 
Pnd) - B n (t) ^ 2 ^ logn ^ 1 di()p(l) = oHi) _ 



Thus, the conclusion will follow if we show min? ie c Qji N n {h) — > in probability or, cquiv- 
alently, if we show that min/igc N n (h) — > in probability, where 

N n (h) = f Q [ f{ p-\ t)) ~ V^(F-\h(t)) F-\t)^j 2 h'(i) dt 

and B is a fixed Brownian bridge. To check that minzj 6 c a N n (h) — > in probability, we 
observe that min/ ie c Qjl N n (h) < iminfegg^ R n (k), where 

Rn{k) = J 1 [ f{ p-i (t)) - MF~\t + k(t)/y/n) F-\t))^j 2 dt 

and Q n is the set of real-valued, absolutely continuous functions on [0, 1] such that fc(0) = 
fc(l) = and — y/n < k'(t) < r n for almost every t. We assume w.l.o.g. r„ < r n+ i for 
every n. Then Q n C Gn+i for every n and Q := U„>i Sn is the set of all absolutely 
continuous functions on [0,1] such that fc(0) = k(l) = and k' is (essentially) bounded. 
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From our hypotheses it follows easily that, for k G 

and hence minfegg„ R n {k) — > (therefore nyV|(7?. Qn (P n ),P) —> 0) will follow if we show 
that inffcgg R(k) = 0. But this can be checked easily by noting, for instance, that if k n 
is the function that interpolates B(t) at knots i/n, i = 0, . . .,n, and is linear in between, 
then we have k n £Q and R(k n ) — > 0. □ 

Proof of Theorem 2. We write cto = dTv(P,Q) and take Pq as in the canonical 
decomposition in Proposition 2 (we take fi to be the Lebesgue measure there). Then 
Po G 7Z ao (P) holds with P and Pq playing the roles of P and Q and the density of Pq 
satisfies the assumptions in Theorem 5 (in fact /q = (/ A <?)/(l — cto) nas a bounded 
derivative a.e., but this suffices for the strong approximation in the proof of Theorem 5). 
Hence, \fnW2(Jl-a n {Pn) , Po) — > in probability and similarly for y/nW2(R- arl (Q n ) , Pq) . 
The triangle inequality for W2 yields the conclusion. □ 



A. 2. Asymptotic theory for the bootstrap 

The behavior of the bootstrap p- value under the alternative follows from the next result. 

Proposition 4. Assume X n ^, X n _ n i; Y nt i, Y n>m i are i.i.d. random variables with 
common distribution P n G T2 such that W^iPnj P) — > 0. If P*, and Q* n , denote the em- 
pirical measures on X n> \, . . . , X n .„/ and Y„_i, Y n .m' , respectively, and n' , m! — > 00, 
then 

W2 {Pn' 1 Qm' ) — ^ * n probability. 

Proof. By Proposition 3 it is enough to consider the case P n = P for all n. But then 
Pn' —>w P a.s. by the Glivenko-Cantelli theorem while the law of large numbers gives 
convergence of second-order moments. These two facts imply that W^P^/jP) — > (and 
for W 2 (Q* m ,,P) as well). □ 

Now we take care of the null hypothesis. The next result will be useful for P and Q 
away from the boundary. Its proof is analogous to that of Theorem 2.1 in [4]. 

Proposition 5. Assume X nj i, . . . , X nj „i are i.i.d. random variables with common dis- 
tribution P n G Ti such that W2(Pn, P) ->• 0. 1/ X n<n i := ^7 Yn=i x n.i, then 

Vn/(X n>n > - n n ) -> w N(0,a 2 ), 
where /i„ = E(X n>n t) and a 2 is the variance of P. 
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Proof of Theorem 3. We will assume for simplicity n = m and n' = m'. The gen- 
eral case can be handled with straightforward modifications. We consider first the 
case cItv{P,Q) > ol. In this case wc have (Theorem 1) that W 2 (P niQI „, Pa) — > and 
y^2{Qn,a n ,Qa) -> a.s. Since 

W 2 2 (aPi + (1 - a)P 2 ,aQ 1 + (1 - a)Q 2 ) < oWf (Pi.Qi) + (1 - a)W 2 2 (P 2 ,g 2 ) 

for probabilities Pi,Qi G T% and a G [0,1] (see [2]) it follows that W 2 (i?„ ; „, XP a + (1 — 
A)Q a ) — > a.s. Note that 



Now, Theorem 1 implies that W 2 (P„, Q „,Qn, Q „) W 2 (7?. ce (P),72. Q ((3)) > 0, while n/n' is 
bounded away from by assumption. This, together with Proposition 4, gives (ii). 

We assume now that (Itv(P,Q) < a. Then Theorem 2 ensures that v / ^W 2 (P„. Qll , 
Qn,a n ) — > in probability. Now, if Pi,P 2 are probabilities in T<i with means fJ-i,(J-2 
and P\,p2 are their centered versions, then it is easy to check that W 2 (Pi,P 2 ) = 
(jx 1 -H2) 2 +yVl(P 1 ,P 2 ) and, therefore, Wf(Pi,P 2 ) > (fii~fi 2 ) 2 - Let X*, and ^respec- 
tively, denote the means corresponding to the X's and Y's bootstrap samples, and fi n 
be the mean of the parent bootstrap distribution, R n , n - Then 



From the Glivcnko-Cantclli theorem we have a.s. tightness of {P n }n and {Q n } n and, 
as a consequence, of P n ,a n and Q n ,a n (see Proposition 2.1 in [2]). We can assume, tak- 
ing subsequences if necessary, that P n ,a„ —*w Po and Q n ,a n —*w Qo for some probabili- 
ties Po,Qo- A little thought shows that, necessarily, P G lZ a (P) and Qo G TZ a (Q)- Since 
W 2 (P„,a„,<3«.a„) -> 0, necessarily, P = Q G 7vL Q (P) n TZ a (Q). Also, since P,Q G J" 2 , 
the strong law of large numbers shows that the map x 2 is uniformly integrable with 
respect to {P„}„ and {Q n } n a.s., hence also with respect to {P n ,a n }n and {Q n ,Q„}m- 
Thus, perhaps through subsequences, W 2 (P niQ ,„ , Po) — > and W 2 (Q n ,a nJ -Po) — > 0, hence 
W 2 (P„,„,P ) -> for some P G ft a (P) n Tt a {Q)- 

The function that sends P to its variance is continuous in T 2 for the W 2 metric. Hence, 
since 1Z a (P) (~MZ a (Q) is compact, the variance attains its minimum there. Let us write 
a"o = mm .Re7?. Q (P)nrc a ((2) Var(P). Then ctq > (a trimming of a probability with a density 
has a density, hence, cannot have null variance) and if we write a 2 for the variance of Po , 
we have 




n'W 2 {P* n ,Q* m )>n\X, 



Y:,) 2 = (Vri(X, 



Vri(Y:,-n n )) 2 . 
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Thus, Proposition 5 and the fact that y/nW2{Pn,a, n i Qn,a n ) —> yield that p* n n — > 1 in 
probability, showing (i). □ 

Proof of Theorem 4. As in the proof of Theorem 2, we assume that X n = (1 — U n )A n + 
U n B n , Y n = (1 - V n )C n + V n D n with {A n } n , {B n } n , {C n } n , {£>„}„, {[/„}„, {K}„ 
independent i.i.d. sequences of which {A„}„ and {C n } n have common distribution Pq 
while {U n } n and are Bernoulli with mean a. We write JV n = ~ 1) an d 

M n = X)iLi I(Vi = !)• Also we put = n — iV n , n' 2 = n — M n and write Ai, . . . , A^ and 
Yi,... ,Y n > for the data corresponding to Ui = and Vi = 0, respectively. 

On the set E n := (N n < na n ,M n < na„), the empirical measures on Ai, . . . , A n < and 
Yi , . . . , (which we denote P n > and ) satisfy P„/ € 7vL Qll (P„) and G 7?. Q „ (Q n ). 
Hence, we have W 2 (P„, Qll , Qn,a n ) < W 2 (P„< , Q^). Thus, 

F(K.n < j8) < F(^) + n(Pn < P) 

where 

& =p*(^ 7 w 2 (p:„q;') > Vn(i-«)w 2 (p„i,Q»i)). 

By the central limit theorem (CLT) we have P(2?„) — > 7. Hence it suffices to control 
IP((Pn < j8) H If Jij ■ • ■ , Jn'j L\, . . . , L n > arc i.i.d. random variables with law Pq, in- 
dependent of the data (both original and bootstrap) and pb n i , v n i are the empirical 
measures, then Theorem 3 and the fact that W 2 (C(aX),C(aY)) = aW 2 {C(X),C(Y)) for 
a > imply 



W 2 (r(Vn'W 2 (P„*,,Q;,))^(v / n'W 2 (^,^)))<2Vn'W 2 (P„,„,Po) 



By Lemma 1 below v fi'W 2 (-Rn,n, Po)Ie„ — > in probability. The assumptions on P and Q 
yield that \frt! W 2 (/i n / , v n < ) converges weakly to a non-null limiting distribution as in (9) 
(with a proof as in Theorem 4.6 in [10]). We call 77 the limit probability measure. Then 



\p* n - r)( ( y/n(l-a)W 2 (P n { , Qn' 2 ) , 00)) | I En -> 
in probability. As a consequence, 



p((p; <j9)n £„) - P((7 ? (( v /n(l-a)W 2 (P„ i , Q ni ),oo)) < /?) n £„) -> 0. 

But 



P( (77 ( ( vXl - a)W 2 (JP n /,Q n /),oo))</3)n^„) 



< P((t?(( Vn(l - a)W 2 (P ni , Q ni ),oo)) < /?)) -> /3, 



since, as above, ^/ n(l — a)W 2 (P„' i , Q n ^) converges weakly to 77. This completes the 
proof. □ 

The following technical result has been used in the proof of Theorem 4. 
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Lemma 1. With the notation and assumptions of Theorem 4, 

Proof. We use the parametrization in (7). We have P n ,a n = {Pn)h n , Qn,a n = (Qn)l n > f° r 
some h n , l n £ C Qn . Writing F" 1 , G" 1 , F" 1 and G^ 1 for the quantile functions of P n , Q n , 
P and Q we have W 2 (P„, a „ , Q„ ,,«„ ) = 1 1 F~ 1 o /i" 1 - G" 1 o l~ 1 j 1 2 , with 1 1 • 1 1 2 denoting the 
usual norm in L 2 (0, 1), namely, \\b\\ 2 = J b 2 . Now 

|| (F- 1 o h- 1 - G- 1 o C 1 ) - (F- 1 o fc" 1 - G- 1 o C 1 )^ 

< ll^n 1 ° C - ° K% + WG- 1 Q i" 1 - G- 1 o Z" 1 ^ 

<- ?T ^(||i^ 1 -J'- 1 || a + ||G ! » 1 -G- 1 || a ) > 

where wc have used that J 1 (F- 1 (h- 1 (t))-G- 1 (h- 1 (t)) 2 dt=J^(F- 1 (x)-G- 1 (x) 2 h'(x) dx. 
The assumptions on P and Q ensure that, as in (9), Hi* 1 " 1 — F _1 j| 2 + ||G^ X — G -1 ]^ = 
Op(n -1 / 2 ). On the other hand, on E n , 

| IF" 1 oh- 1 - G- 1 ol~% = W 2 (P n , an ,Q n , a J < W 2 (P n[ ,Qn>) = Opin- 1 / 2 ). 

Combining these two facts we see that W 2 (Ph n , Qh^Ie^ = || F 1- 1 o /i" 1 — G — 1 o Z~ 1 1| 2^^^ = 
Op(n" 1 / 2 ). Using (12) we see that W 2 (fJ,F ) = o"(n~ p/2 )- Since W 2 (P hn , P„, Q J = 
Op(n~ 1//2 ), we conclude that W 2 (P n ,a n , Pq)Ie„ — 0(n~ p / 2 ). Convexity and a similar 
argument for Q n ,a„ yield the result. □ 

Proof of Example 1. The fact that oItv(P,Q) = a follows from noting (with some 
abuse of notation) that for F" 1 G TZ a (P) and G" 1 € TZ a (Q) 

F _1 (t) < F~V + (1 - < G" 1 ^). 

Hence, the probability Pq with quantile F " 1 (<) = F _1 (a+ (1 — a)t) is the unique element 
in K a (P) n Tl a (Q). Next we observe that, for F" 1 G (F), 

F"!(i) <F" 1 K + (l-a„)t) 

< F^i) + (F- X K + (1 - «n» - F- X (a + (1 - a n )t)). 

Similarly, if G" 1 G K an (Q), G _1 (t) > F _1 (t) - (F- 1 (a n + (1 - a„)i) - F" 1 ^ + (1 - 
a n )i)) and, combining both inequalities, we get |F _1 (i) - F _1 (f)| < IF" 1 ^) - G _1 (i)| + 
|F _1 (a„ + (1 — a„)t) — F _1 (q + (1 — a„)t)\ and the bound follows from the triangle 
inequality. □ 

Proof of Example 2. We write Fq for the distribution function of Pq, hence, F ~ 1 (y) = 
At/2 + F" 1 ((l - a)y) for y G (0, 1/2] and F _1 (y) = -/i/2 + F~ x {a + (1 - for y G 
[1/2,1). Similarly, we write F„ and G„ for the distribution functions of P n and Q n , 
respectively. Necessarily, F„(0,oo) < - F(f )) = |(1 + We write /3„ = 
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§ - P n (0,oo). It follows from the fact that W 2 (P n ,Q„) -> that W 2 {P n ,Po) -> and, 
therefore, that f3 n — >• 0. We give next a lower bound for W 2 (P n ,Q n ), assuming that 
fi n > 0. If this is the case 

F-\t)<-^ + F- 1 (a + {l-a n ){t-p r A + ^\, tzU,±+p\ (14) 

On the other hand G~ 1 ((l — a n )t) > + P _1 ((l — a n )t). Standard computations show 
that there is a unique a = a(/3 n ) > such that F(a — 2 ) — + 2 ) + a = (1 — o)Pn and 
that 



2 

for t £ (yj^P(— a — |). From this we get that 



2 + F~\a + (1 - a)(t - /3)) < fi/2 + F~\(l - a)t) 



yV 2 (Pn,Qn) > V9l(Pn) ~ S„,l " S n , 2 , (15) 

where 5l (/3) = J^! a _ M/2)/(1 _ a) (M + ^((l - «)*) - P- J (« + (1 - «)(* - £))) 2 dt, 4,1 = 

/^ M / a) /( 1 -a)(^ 1 ((l-a)*)--F , - 1 ((l-a»)t)) a d* > 4, 2 = ^(! _, /2)/(1 _ a) (^- 1 (« + 
(1 - a)(t - p n )) - F- 1 (a + (1 - a n ){t - /?„) + 57s)) 2 di. A routine use of Taylor ex- 
pansions yields lim^ 0+ "0- = (1 - a) 3/2 ^ffl > °= <i = O^n" 1 ) and < 2 = 
0(v / /3^Vi~ 1 )- From this and (15) we obtain 

/} n = 0(n- 2 / 5 ), (16) 

with a similar bound being satisfied by j n = i — Q„(— oo,0). 

We turn now to the upper bound for W2(Pn,Po)- From the triangle inequality we get 

/ rl/2 \l/2 / ,1 x 1/2 

W 2 (P n ,Po)<^ (P^-Po" 1 )^ + {J 1/2 ^n 1 - F 1 ) 2 ) 

/ pl/2 \ 1/2 / ,1 N 1/2 

< W 2 (P n ,Q n ) + (j( (G- 1 P^ 1 ) 2 ] + yj i/2 {F- 1 Po" 1 ) 2 
We consider next /^(P^ 1 — Po" 1 ) 2 - Since P„ € lZ an (P) we have 

P,7 1 (t)<-| + p- 1 K + (i-a„)0, te(0,i). (17) 

Keeping the above notation for j3 n , assume first that /3 n <0. Then 

F-\t) > - | + P- 1 fa + (1 - a„)t + , f G Q, l) (18) 

(this follows upon noting that P~ 1 (|+) > and ^n X (*) = -F -1 ^ -1 ^))) growing 
with slope at least 1 — a n ). For i £ (i, 1), (17) and (18) still hold if we replace P" 1 
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by Fq 1 . Hence, in this case j^F" 1 - F^ 1 ) 2 < Jy 2 {F~ 1 {a n + (1 - a n )t) - F~ 1 {a n + 



2-^/n ; 

If f3 n > 0, then, arguing as above, we have 



F~\t) > -| + F- 1 (Y + (1 - a„)(i - /?„) + , i e Q +/3„, lY 



(19) 



while (14) holds in (0 ,\ + /?„). Now we use the bound 



2 

1 \ 1/2 / /.l/2+/3„ x 1/2 / „1 x 1/2 

^n 1 -^ 1 ) 2 < / (iT 1 -^ 1 ) 2 +(/ (F- 1 -^ 1 ) 2 

1/2 / \Jl/2 / \Jl/2+i3n 

and proceed as follows. For £ e (i + /3„, 1) (17) and (19) hold again after replacing F" 1 
by Fq 1 . This and the triangle inequality yield 

1/2 

(F-'-F^ 

l/2+/9 n 

/ /-I N, 1/2 

< / (F- 1 (a + (l-a)t)-F- 1 (a + (l-a)(t-/3„))) 2 df 



l/2+/3„ 



(20) 

2x1/2 y ' 



= V9i{Pn) +2s„ j3 . 

For the interval (^ , \ + f3 n ) we write G _1 (t) = § + F _1 ((l — a n )t) (the minimal quantile 

function in TZ a „(Q)). Then {f$ +P *fa l F^ 1 ) 2 ) 1 ' 2 < (J^+^F" 1 - G^ 1 ) 2 ) 1 / 2 + 

ill/l^iQT 1 - F^ 1 ) 2 ) 1 ' 2 . We observe now that G _1 (i) > G' 1 ^) and also that, for 

te (§,f +£„), -f + F- 1 (a + (l-a)(i-^„)) <0< f + F- 1 ((l-a)i). Combining these 
facts with (14) we obtain 

I^^-G^WI^IF^^-G- 1 ^)! 

+ \F- 1 ((l-a n )t)-F- 1 ((l-a)t)\ 

F- 1 (a + (1 - On)(t - AO + - ^(a + (1 - <*)(* - Pn)) 



As a consequence 

/ (^-^ 
X./1/2 



l/2+/3„ \ 1/2 

-1\2 


1/2 

< W 2 (F„,Q n ) + (/i + F- X ((l - a)t) - F-^q + (1 - a)t)) 2 dt 
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l/2+/3„ / / R x 

F- 1 a +(l-a„)(t-/3 n ) + . 



1/2 



F" 1 ^ +(!-«)(*- AO)) d *) 



= W 2 (P„,Q„) + VfteWnj + 2s nA + S„, 5 , 

where ff 3 (^) = fu^il* + - a)t) - F" 1 (a + (1 - a)t)f dt. Again a Taylor expan- 

sion shows that <?3(/3 n ) = 0(/3^) = o(n _1 ). Similarly, we get s n j = o(n _1 ), j = 4, 5, and, 
as a consequence 

l/2+/3„ \ 1/2 

(F- 1 -^" 1 ) 2 =0(n- 1 /2). (21) 



'1/2 

Collecting the estimates in (20) and (21), we obtain 

(/ Vn 1 - F o^) 2 ) V2 < + 2 S „, 3 + OCn" 1 / 2 ). (22) 

We note next that F _1 has a bounded derivative and, as a consequence, s 2 3 = 0(n _1 ). 
Similarly, we find that gi (/3„) = 0(/3 2 ). Summarizing, 



^ 1 2 (F- 1 -F- 1 ) 2 ) 1/2 = 0(n- 2 / 5 ). 



'1/2 

A similar analysis works for §q^{G~ 1 — Fq" 1 ) 2 and completes the proof. □ 

Proof of Proposition 3. We take (-Xi,i,5i i) to be an optimal coupling for P and Q 
with respect to the \\x — y|| p -cost and (Xi t i, Y\^), 2 <i <n, and (X2J, ¥2,3), 1 < 3 < 
independent copies of (Xi,i,y M ) (hence' - Y itj \\P = WP(P,Q)). Then 5„, m = 
min w (a(7r)) 1 / p and T„ iTO = min T (6(7r)) 1 / p , where 

o(tt)= £ ^jllAi-^ll 1 '. 

l<z<n,l<j<m 



6(7r) is defined similarly by replacing Xjj by l^j and 7r takes values in the set of n x m 

Lch that J 

We observe next that, by the triangle inequality, 



matrices with non-negative entries TTij such that Xa<j< m = \ an d X)i<i<n — 



i/p 

\a{ir) 1 '*-b(*) 1 '*\<( Y, ^J(Xi,i-X 2d )-(Y hi -Y 2 , j )r 

K l<i<n,l<j<m 

/ i \ 1/P / i \ 1/P 

^ l<i<n ' ^ l<j<m 
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As a consequence, we have that \S niTn — T n _ m \ is upper bounded by the right-hand side 
of the above display and, from the elementary inequality (a + b) p < 2 p ~ 1 a p + 2 p ~ 1 b p for 
non-negative a, b, we get 

E(S n , m - T, hm ) p < 2 p - 1 E\\X 1 . 1 - Y hl \\ p + 2 p - 1 E\\X 2 , 1 - Y 2>1 \\ p 
= 2 p W p (P 7 Q). 

This completes the proof. □ 
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